[CELEBORN-2327] Add active-slot weight to load-aware placement#3685
[CELEBORN-2327] Add active-slot weight to load-aware placement#3685sunchao wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new optional celeborn.master.slot.assign.loadAware.activeSlotsWeight config (default 0.0) so load-aware slot placement can also account for each disk's currently-reserved active slots when sorting candidate disks, helping reduce placement skew under overlapping shuffle-heavy workloads.
Changes:
- Define and thread the new
activeSlotsWeightconfig fromCelebornConfthroughMaster.scalaintoSlotsAllocator.offerSlotsLoadAware/placeDisksToGroups, where it is added to the existing flush/fetch-time sort key. - Document the new tuning knob in
docs/configuration/master.mdand updatedocs/developers/slotsallocation.mdto reflect the updated ordering formula. - Add a regression test in
SlotsAllocatorSuiteJthat verifies, withactiveSlotsWeight=1and flush/fetch weights 0, the lower-active-slot worker is preferred over an otherwise equivalent overloaded worker.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala | Adds MASTER_SLOT_ASSIGN_LOADAWARE_ACTIVE_SLOTS_WEIGHT config entry and accessor. |
| master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala | Reads new config and passes it into offerSlotsLoadAware. |
| master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java | Adds activeSlotsWeight parameter and incorporates activeSlots * activeSlotsWeight into disk comparator. |
| master/src/test/java/org/apache/celeborn/service/deploy/master/SlotsAllocatorSuiteJ.java | Adds regression test and threads new parameter through existing helper call sites. |
| docs/configuration/master.md | Documents new config row. |
| docs/developers/slotsallocation.md | Updates ordering formula description. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
SteNicholas
left a comment
There was a problem hiding this comment.
LGTM. Thanks for contribution.
|
@sunchao, please use |
|
@SteNicholas Updated |
|
Thanks. Merged to main(v0.7.0). |
|
Thanks @SteNicholas @pan3793 for the review! |
Why are the changes needed?
Celeborn load-aware slot placement currently orders candidate disks using flush and fetch timing only. That can still keep assigning new partitions onto disks that already carry a large amount of reserved active-slot pressure, which makes placement skew worse under overlapping shuffle-heavy workloads. CELEBORN-2327 tracks this gap.
What changes were proposed in this PR?
celeborn.master.slot.assign.loadAware.activeSlotsWeightconfig.activeSlots * activeSlotsWeightin load-aware disk ordering.How was this PR tested?
UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite./build/mvn -pl master -am -Dtest=SlotsAllocatorSuiteJ -DwildcardSuites=org.apache.celeborn.NoSuchSuite -DfailIfNoTests=false test./build/mvn -pl master -am -DskipTests test-compile